BMC Medical Informatics and Decision Making — Latest Matching Preprints

1

Towards clinical implementation of artificial intelligence in cancer care: Concept mapping analysis of provincial workshop findings

Nayyar, C.; Xu, H. H.; Bates, A. T.; Conati, C.; Hilbers, D.; Avery, J.; Raman, S.; Fayaz-Bakhsh, A.; Nunez, J.-J.

2026-03-27 health systems and quality improvement 10.64898/2026.03.26.26349205 medRxiv

Top 0.1%

28.5%

Show abstract

Background: Artificial intelligence (AI) has rapidly garnered interest in healthcare, with research showing promise to improve quality, efficiency, and outcomes. Cancer care's multidisciplinary nature and high coordination demands are well positioned to benefit from AI. While attitudes in the uptake of evidence and toward the implementation of AI in medicine has been explored generally, literature remains scarce with specific regards to AI in cancer care. This study sought to understand how perspectives of both patients and professionals are essential for guiding responsible, effective implementation of evidence-based (EB) AI in cancer care. Methods: We conducted a workshop at the 2024 British Columbia (BC) Cancer Summit (Vancouver, Canada). Discussions addressed three guiding questions: concerns, benefits, and priorities for AI in cancer care. Responses from 48 workshop participants (patients and families, AI/computer science/cancer researchers, clinicians and allied health professionals, information technology professionals, healthcare administrators) underwent structured conceptualization by concept mapping, leveraging multidimensional scaling and hierarchical cluster and subcluster analysis to produce visual and quantitative maps of stakeholder priorities. Results: A total of 265 statements on perceived benefits, concerns, and priorities related to the implementation of AI in cancer care were generated from the workshop and underwent concept mapping. Two clusters were identified; Cluster 1 focused on "Challenges and Safeguards for AI Implementation," and Cluster 2 focused on "Clinical Benefits and Efficiency Gains." Subcluster analysis distinguished 8 thematic subclusters (4 per cluster). Both mean importance (P < .001) and feasibility (P < .001) ratings were significantly higher for Cluster 2. No differences were found between ratings by clinical and nonclinical professionals. Further go-zone analysis classified statements according to their relative superiority/inferiority in importance and feasibility compared to the overall average. Conclusions: Stakeholder ratings were higher for statements describing clinical benefits and efficiency gains than for those describing challenges and safeguards for AI implementation in cancer care. Concept mapping analysis distinguished between workflow-aligned AI applications, perceived as ready for implementation, and system-level governance requirements requiring longer-term investment. Present findings provide a structured, stakeholder-informed framework for prioritizing and sequencing AI implementation efforts in cancer care, constituting a practical blueprint to catalyze meaningful progress.

2

A bibliometric review of explainable AI in diabetes risk prediction: Trends, gaps, and knowledge graph opportunities

Van, T. A.

2026-04-20 health informatics 10.64898/2026.04.16.26351069 medRxiv

Top 0.1%

13.8%

Show abstract

BackgroundType 2 diabetes mellitus (T2DM) is a leading global public health challenge. Machine learning (ML) combined with Explainable AI (XAI) is increasingly applied to T2DM risk prediction, but the field lacks a quantitative overview of methodological trends and integration gaps. MethodsWe present a structured synthesis and critical analysis of the XAI literature on T2DM risk prediction, combining (i) quantitative bibliometric analysis of a two-database corpus (N = 2,048 documents from Scopus and PubMed/MEDLINE, deduplicated via a transparent three-tier pipeline) and (ii) an in-depth selective review of 15 highly cited papers. Reporting follows PRISMA 2020, adapted for metadata-based synthesis; analyses include keyword frequency, rule-based thematic clustering, and publication trend analysis. ResultsThe field grew rapidly, from 36 documents (2020) to 866 (2025). SHAP and LIME dominate XAI methods; XGBoost and Random Forest dominate ML models. Critically, KG/GNN terms appeared in only 17 documents ([~]0.83%) compared with 906 for XAI methods, a 53.3:1 disparity. This gap is consistent across both databases, which share 33.2% of their records, ruling out a single-database artifact. The selective review confirmed that none of the 15 highly cited papers combined all three components, ML, XAI, and KG, in T2DM risk prediction. ConclusionsThe XAI for T2DM risk prediction field exhibits a clinical interpretability gap: statistical explanations are rarely linked to structured clinical pathways. We propose a three-layer conceptual framework (Predictive [->] Explainability [->] Knowledge) that integrates KG as a supplementary semantic layer, with potential applications in clinical decision support and population-level screening. The framework does not perform true causal inference but structures explanations around established pathophysiological knowledge. This study contributes a transferable methodology and a quantified research gap to guide future work integrating ML, XAI, and structured medical knowledge.

3

The Evolution and Equity of Chinas Pharmacist Workforce in Healthcare Institutions: A Provincial Panel Data Analysis, 2007-2023 Evolution and equity of China's pharmacist workforce

xia, y.; Sun, L.; Zhao, Y.

2026-04-23 health policy 10.64898/2026.04.22.26351514 medRxiv

Top 0.2%

10.5%

Show abstract

Background: China has implemented policies to strengthen its pharmacist workforce since the 2009 healthcare reform, yet a comprehensive evaluation of their long-term systemic effects is lacking. Objective: To systematically analyze the evolution of Chinas pharmacist workforce in healthcare institutions from 2007 to 2023 across four dimensions: quantity, quality, structure, and distribution, providing an empirical foundation for policy optimization. Methods: A retrospective analysis was conducted using longitudinal data from the China Health Statistics Yearbooks. Trends were delineated via descriptive statistics. Equity and spatial evolution were assessed using the Gini coefficient, Theil index decomposition, and spatial autocorrelation analyses (Global Morans I and hotspot analysis). Results: From 2007 to 2023, the total number of pharmacists increased from 357,700 to 569,500 (average annual growth: 2.2%). This growth lagged behind physicians (4.6%) and nurses (7.4%), causing the pharmacist-to-physician ratio to decline from 1:5.15 to 1:8.39. The workforce showed trends of feminization (female proportion rose from 59.7% to 70.8%) and aging. While quality improved, 51.1% still held an associate degree or below, and only 6.6% held senior titles. Equity analysis revealed the provincial Gini coefficient improved from 0.145 to 0.093. Theil index decomposition confirmed intra-provincial disparities as the primary inequality driver. Spatial analysis showed a non-significant global Morans I by 2023 (0.154, P*>0.05), down from 0.254 (P<0.01) in 2007. Hotspot analysis confirmed this transition, revealing a contraction of high-confidence clusters and a trend toward balanced distribution. Conclusions: China has made measurable progress in expanding pharmacist workforce size and improving inter-provincial equity since 2007. However, persistent structural challenges remain: relative workforce contraction compared to other health professions, an aging demographic, a shortage of senior talent, and significant intra-provincial inequity. Future policies must prioritize optimizing workforce structure and enhancing clinical service capabilities to catalyze a shift toward patient-centered pharmaceutical care.

4

Identification of Suicide-Related Subgroups Using Latent Class Analysis: Complementary Insights to Explainable AI-Based Classification

Kizilaslan, B.; Mehlum, L.

2026-03-27 psychiatry and clinical psychology 10.64898/2026.03.25.26349264 medRxiv

Top 0.2%

10.1%

Show abstract

Purpose: Suicide and self-harm are major public health concerns characterized by substantial clinical and psychosocial heterogeneity. While latent class analysis has been used to identify subgroups of people with suicidal behavior, the extent to which such population-level phenotyping complements explainable artificial intelligence-based classification models remain unclear. Methods: We applied latent class analysis to a cross-sectional, publicly available dataset of 1000 individuals presenting with self-harm and suicide-related behaviors at Colombo South Teaching Hospital, Kalubowila, Sri Lanka. Sociodemographic, psychosocial, and clinical variables were used to identify latent subgroups. Class characteristics and suicide prevalence were examined and compared with variable importance patterns reported in a previously published explainable artificial intelligence (XAI)-based suicide classification study using the same dataset. Results: Four latent classes were identified. Two classes exhibited very high suicide prevalence (91.2% [95% CI: 87.7-93.8] and 99.0% [95% CI: 96.4-99.7]), whereas two classes showed low prevalence (<1%). The two high-prevalence classes differed markedly in lifetime psychiatric hospitalization history, with one class showing a 100% prevalence of prior hospitalization and the other substantially lower hospitalization rates. These patterns partially aligned with, and extended beyond, variable importance findings from the XAI-based model. Conclusion: Latent class analysis identified distinct subgroups with substantially different suicide prevalence and clinical profiles, underscoring the heterogeneity of individuals presenting with self-harm. Comparison with XAI-based suicide classification model findings suggest that unsupervised phenotyping and supervised classification provide complementary perspectives, offering population-level context that may enhance the interpretability of suicide assessment frameworks. Keywords: suicide; self-harm; latent class analysis; explainable artificial intelligence; machine learning

5

ECG spectrogram-based deep learning model to predict deterioration of patients with early sepsis at the emergency department: a study from the Acutelines data- and biobank

van Wijk, R. J.; Schoonhoven, A. D.; de Vree, L.; Ter Horst, S.; Gaidhane, C.; Alcaraz, J. M. L.; Strodthoff, N.; ter Maaten, J. C.; Bouma, H. R.; Li, J.

2026-03-27 emergency medicine 10.64898/2026.03.26.26349371 medRxiv

Top 0.3%

9.3%

Show abstract

Purpose: Early recognition of deterioration in patients with suspected infection at the emergency department (ED) is important. Current clinical scoring systems show limited discriminative performance for early deterioration. Continuous electrocardiogram (ECG) recordings may offer additional dynamic physiological information that can enhance early prediction of deterioration in patients with suspected infection. Methods: We developed a multimodal, ECG-derived spectrogram-based pipeline to predict deterioration within 48 hours of ED admission. We used the first 20 minutes of ECG recordings for the spectrograms. We compared the model with the National Early Warning Score (NEWS), quick Sequential Organ Failure Assessment (qSOFA), a baseline model with vital parameters, sex, and age, and a Heart Rate Variability (HRV) derived model. Results: In this study, 1321 patients were included, of whom 159 (12%) deteriorated. The multimodal model combining baseline data with spectrograms showed the best overall performance, with an Area Under the Receiver Operating Characteristic (AUROC) of 0.788, followed by the baseline model (age, sex, triage vitals) alone, with an AUROC of 0.730. The HRV-only model and the qSOFA showed the lowest performance (AUROC 0.585 and 0.693, respectively). Conclusion: This study shows that ECG-derived multimodal spectrogram models outperform those based solely on vital signs and HRV features, as well as established clinical scores such as NEWS and qSOFA. Spectrogram analysis represents a promising approach to enhance early risk stratification and support clinical decision-making for patients with suspicion of infection in the ED.

6

Nationwide Prediction of Missed and Cancelled Appointments Using Real-World EHR Data

Miran, S. A.; Cheng, Y.; Faselis, C.; Brandt, C.; Vasaitis, S.; Nesbitt, L.; Zanin, L.; Tekle, S.; Ahmed, A.; Nelson, S. J.; Zeng-Treitler, Q.

2026-04-13 health informatics 10.64898/2026.04.08.26349942 medRxiv

Top 0.3%

9.3%

Show abstract

ObjectivesTo develop and evaluate predictive models for unused outpatient appointments (missed or cancelled) using a large national electronic health record (EHR) repository in the United States. DesignRetrospective observational study using machine learning and statistical modeling. SettingA U.S. national electronic health record repository (Cerner Real World Database) covering healthcare encounters from 2010 to 2025. ParticipantsAdult patients aged [≥]18 years with routine outpatient encounters recorded in the database. One outpatient appointment with a known status was randomly selected per patient, resulting in a final analytic sample of 5,699,861 encounters. Primary and Secondary Outcome MeasuresThe primary outcome was whether the index outpatient appointment was attended or unused (missed or cancelled). Model performance was evaluated using area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. MethodsPredictors included patient characteristics (demographics and insurance type), appointment characteristics (day, time, season, and urbanicity), prior cancellation rate, and time gap between the index appointment and the previous visit. We compared the predictive performance of two machine learning models (random forest classifier and extreme gradient boosting (XGBoost)) with logistic regression. An explainable AI analysis of feature impact was performed on the final XGBoost model. ResultsAmong 5,699,861 outpatient encounters, 3,650,715 (64.0%) were attended and 2,049,146 (36.0%) were unused. XGBoost achieved the best predictive performance on the test dataset (AUC = 0.95), followed by random forest (AUC = 0.92) and logistic regression (AUC = 0.89). Feature impact score analysis revealed highly non-linear associations between predictors and the risk of unused appointments at the individual level. ConclusionsUnused outpatient appointments can be accurately predicted using routinely available EHR data. Integrating predictive models into scheduling workflows may improve healthcare efficiency and optimize appointment management. Article SummaryStrengths and limitations of this study O_LIThis study used one of the largest national electronic health record datasets to develop predictive models for unused outpatient appointments. C_LIO_LIMultiple modeling approaches, including logistic regression and machine learning methods (random forest and XGBoost), were compared to evaluate predictive performance. C_LIO_LIAn explainable artificial intelligence method was applied to quantify feature impact and improve model interpretability. C_LIO_LIThe retrospective design and reliance on routinely collected EHR data may introduce data quality limitations and unmeasured confounding. C_LIO_LIThe database did not distinguish clearly between cancelled appointments and no-shows. C_LI

7

Longitudinal information extraction from clinical notes in rare diseases: an efficient approach with small language models

Wang, X.; Faviez, C.; Vincent, M.; Andrew, J. J.; Le Priol, E.; Saunier, S.; Knebelmann, B.; Zhang, R.; Garcelon, N.; Burgun, A.; Chen, X.

2026-03-31 health informatics 10.64898/2026.03.30.26349388 medRxiv

Top 0.3%

8.5%

Show abstract

Objectives Rare diseases often require longitudinal monitoring to characterise progression, yet much clinical information remains locked in unstructured electronic health records (EHRs). Efficient recovery of such data is critical for accurate prognostic modelling and clinical trial preparation. We aimed to develop and evaluate a small language model (SLM)-based pipeline for extracting longitudinal information from French clinical notes of patients with rare kidney diseases. Methods As a use case, we focused on serum creatinine, a key biomarker of kidney function. We analyzed 81 clinical notes comprising 200 measurements (triplet of date, value and unit). Four open-source SLMs (Mistral-7B, Llama-3.2-3B, Qwen3-4B, Qwen3-8B) were systematically tested with different prompting strategies in French and English. Outputs were post-processed to standardize formats and resolve inconsistencies, and performance was assessed across model size, prompting, language, and robustness to text duplication. Results All SLMs extracted structured triplets, with F1-scores ranging from 0.519 to 0.928 (Qwen3-8B), outperforming the rule-based baseline. Larger models generally performed better, while prompting strategy and language had modest effects across models. SLMs also showed variable robustness to duplicated content common in real-world EHR notes. Discussion Lightweight, locally deployable language models can accurately extract longitudinal biomarkers from unstructured clinical notes. Our findings highlight their practicality for rare diseases where data scarcity often limits task-specific model training. Conclusion SLMs provide a privacy-preserving and resource-efficient solution for recovering longitudinal biomarker trajectories from unstructured notes, offering potential to advance real-world research and patient care in rare kidney diseases.

8

Performance of open-source large language models on nephrology self-assessment program

Ahangaran, M.; Jia, S.; Chitalia, S.; Athavale, A.; Francis, J. M.; O'Donnell, M. W.; Bavi, S. R.; Gupta, U. D.; Kolachalama, V. B.

2026-04-16 nephrology 10.64898/2026.04.16.26348910 medRxiv

Top 0.3%

8.4%

Show abstract

Background: Large Language Models (LLMs) have demonstrated strong performance in medical question-answering tasks, highlighting their potential for clinical decision support and medical education. However, their effectiveness in subspecialty areas such as nephrology remains underexplored. In this study, we assess the performance of open-source LLMs in answering multiple-choice questions from the Nephrology Self-Assessment Program (NephSAP) to better understand their capabilities and limitations within this specialized clinical domain. Methods: We evaluated the performance of five open-source large language models (LLMs): PodGPT which a podcast-pretrained model focused on STEMM disciplines, Llama 3.2-11B, Mistral-7B-Instruct-v0.2, Falcon3-10B-Instruct, and Gemma-2-9B-it. Each model was tested on its ability to answer multiple-choice questions derived from the NephSAP. Model performance was quantified using accuracy, defined as the proportion of correctly answered questions. In addition, the quality of the models explanatory responses was assessed using several natural language processing (NLP) metrics: Bilingual Evaluation Understudy (BLEU), Word Error Rate (WER), cosine similarity, and Flesch-Kincaid Grade Level (FKGL). For qualitative analysis, three board-certified nephrologists reviewed 40 randomly selected model responses to identify factual and clinical reasoning errors, with performance summarized as average error ratios based on the proportion of error-associated words per response. Results: Among the evaluated models, PodGPT achieved the highest accuracy (64.77%), whereas Llama showed the lowest performance with an accuracy of 45.08%. Qualitative analysis showed that PodGPT had the lowest factual error rate (0.017), while Llama and Falcon achieved the lowest reasoning error rates (0.038). Conclusions: This study highlights the importance of STEMM-based training to enhance the reasoning capabilities and reliability of LLMs in clinical contexts, supporting the development of more effective AI-driven decision-support tools in nephrology and other medical specialties.

9

Developing a Tiered Machine Learning Alert System for Real-Time Suicide Risk Detection in a Digital Mental Health Setting

Donegan, M. L.; Srivastava, A.; Peake, E.; Swirbul, M.; Ungashe, A.; Rodio, M. J.; Tal, N.; Margolin, G.; Benders-Hadi, N.; Padmanabhan, A.

2026-03-30 psychiatry and clinical psychology 10.64898/2026.03.26.26349452 medRxiv

Top 0.3%

8.4%

Show abstract

The goal of this work was to leverage a large corpus of text based psychotherapy data to create novel machine learning algorithms that can identify suicide risk in asynchronous text therapy. Advances in the field of natural language processing and machine learning have allowed us to include novel data sources as well as use encoding models that can represent context. Our models utilize advanced natural language processing techniques, including fine-tuned transformer models like RoBERTa, to classify risk. Subsequent model versions incorporated non-text data such as demographic features and census-derived social determinants of health to improve equitable and culturally responsive risk assessment, as well as multiclass models that can identify tiered levels of risk. All new models demonstrated significant improvements over our previous model. Our final version, a multiclass model, provides a tiered system that classifies risk as "no risk," "moderate," or "severe" (weighted F1 of 0.85). This tiered approach enhances clinical utility by allowing providers to quickly prioritize the most urgent cases, ensuring a more accurate and timely intervention for clients in need.

10

AI Implementation in Safety Net Healthcare: Understanding Barriers and Strategies

Thomas, C.; Kim, J. Y.; Hasan, A.; Kpodzro, S.; Cortes, J.; Day, B.; Jensen, S.; LHuillier, S.; Oden, M. O.; Zumbado Segura, S.; Maurer, E. W.; Tucker, S.; Robinson, S.; Garcia, B.; Muramalla, E.; Lu, S.; Chawla, N.; Patel, M.; Balu, S.; Sendak, M.

2026-04-11 health systems and quality improvement 10.64898/2026.04.07.26350351 medRxiv

Top 0.3%

8.4%

Show abstract

Safety net healthcare delivery organizations (SNOs) serve vulnerable populations but face persistent challenges in adopting new technologies, including AI. While systematic barriers to technology adoption in SNOs are well documented, little is known about how AI is implemented in these settings. This study explored real-world AI adoption in SNOs, focusing on identifying barriers encountered across the AI lifecycle and strategies used to overcome them. Five SNOs in the U.S. participated in a 12-month technical assistance program, the Practice Network, to implement AI tools of their choosing. Observed barriers and mitigation strategies were documented throughout program activities and, at the conclusion of the program, reviewed and refined with participants using a participatory research approach to ensure findings reflected lived experiences and organizational contexts. Key barriers emerged during the Integration and Lifecycle Management phases and included gaps in AI performance evaluation and impact assessments, communication with patients about AI use, foundational AI education, financial resources for purchasing and maintaining AI tools, and AI governance structures. Effective strategies for addressing these barriers were primarily supported through centralized expertise, structured guidance, and peer learning. These findings provide granular, actionable insights for SNO leaders, offering guidance for anticipating barriers and proactively planning mitigation strategies. By including SNO perspectives, the study also contributes to the broader health AI ecosystem and underscores the importance of participatory, collaborative approaches to support safe, effective, and ethical AI adoption in resource-constrained settings. Author SummarySafety net organizations (SNOs) are healthcare systems that primarily serve low-income and underinsured patients. While interest in artificial intelligence (AI) in healthcare has grown rapidly, little is known about how these organizations experience AI adoption in practice. In this study, we partnered with five SNOs over a 12-month program to document the challenges they encountered when implementing AI tools and the strategies they used to address them. We worked closely with SNO staff throughout the process to ensure our findings reflected their lived experiences with AI implementation. We found that the most common challenges arose when organizations tried to integrate AI into daily operations and monitor and maintain those tools over time. Specific barriers included difficulty evaluating whether AI was performing as expected, limited guidance on communicating with patients about AI use, a lack of resources for staff training, limited financial resources, and the absence of formal governance structures. Successful strategies for overcoming these challenges drew on shared knowledge and structured support provided by the program, as well as learning from peer organizations. These findings offer practical guidance for SNO leaders planning or managing AI adoption, and contribute to a broader conversation about what is required to implement AI safely and effectively in healthcare settings that serve the most medically and socially vulnerable patients.

11

Governance, Accountability and Post-Deployment Monitoring Preferences for AI Integration in West African Clinical Practice: A Mixed-Methods Study

Uzochukwu, B. S. C.; Cherima, Y. J.; Enebeli, U. U.; Okeke, C. C.; Uzochukwu, A. C.; Omoha, A.; Hassan, B.; Eronu, E. M.; Yusuf, S. M.; Uzochukwu, K. A.; Kalu, E. I.

2026-04-01 health informatics 10.64898/2026.03.30.26349782 medRxiv

Top 0.3%

8.3%

Show abstract

Background: The integration of artificial intelligence (AI) into clinical practice holds transformative potential for healthcare in West Africa, but safe deployment requires context-appropriate governance, accountability, and post-deployment monitoring frameworks. This cross-sectional mixed-methods study examined preferences and concerns of West African clinicians and technical experts regarding AI governance structures, post-deployment surveillance mechanisms, and accountability allocation. Methods: A structured questionnaire was administered to 136 physicians affiliated with the West African College of Physicians (February 22-28, 2026), complemented by 72 key informant interviews with technical leads, AI developers, data scientists, policymakers, and healthcare leaders. Data were analyzed using descriptive statistics, inferential tests, and thematic analysis. Results: Clinicians strongly preferred independent regulatory bodies (40.4%) for overseeing AI tool performance, with high trust ratings (mean:4.3/5), while vendor self-monitoring received minimal support (3.7%, mean:2.4/5). Real-time dashboards were the most favored monitoring approach (41.9%). Clear accountability pathways (94.1%), algorithm transparency (91.9%), and real-time performance data (89.7%) were rated essential by majorities. Major concerns included clinicians being unfairly blamed for AI errors (76.5%), excessive vendor control (72.8%), and absence of clear reporting pathways (69.9%). Qualitative findings emphasized continuous performance tracking for accuracy, fairness, and bias; structured incident reporting; protocols for model drift and failure; and multi-layered governance combining independent oversight, institutional AI committees, and explicit liability frameworks. Conclusion: This study provides the first empirical evidence from West Africa on clinician preferences for AI governance. Findings offer actionable guidance for policymakers to build trustworthy, equitable, and safe AI integration frameworks that prioritize transparency, independent oversight, and clinician protection. Keywords: artificial intelligence; AI governance; post-deployment monitoring; accountability; West Africa; clinician preferences; health data science.

12

Large language models and retrieval augmented generation for complex clinical codelists: evaluating performance and assessing failure modes

Matthewman, J.; Denaxas, S.; Langan, S.; Painter, J. L.; Bate, A.

2026-04-24 health informatics 10.64898/2026.04.23.26351098 medRxiv

Top 0.3%

7.0%

Show abstract

Objectives: Large language models (LLMs) have shown promise in creating clinical codelists for research purposes, a time-consuming task requiring expert domain knowledge. Here, we evaluate the performance and assess failure modes of a retrieval augmented generation (RAG) approach to creating clinical codelists for the large and complex medical terminology used by the Clinical Practice Research Datalink (CPRD). Materials & Methods: We set up a RAG system using a database of word embeddings of the medical terminology that we created using a general-purpose word embedding model (gemini-embedding). We developed 7 reference codelists presenting different challenges and tagged required and optional codes. We ran 168 evaluations (7 codelists, 2 different database subsets, 4 models, 3 epochs each). Scoring was based on the omission of required codes, and inclusion of irrelevant codes. We used model-grading (i.e., grading by another LLM with the reference codelists provided as context) to evaluate the output codelists (a score of 0% being all incorrect and 100% being all correct). Results: We saw varying accuracy across models and codelists, with Gemini 3 Pro (Score 43%) generally performing better than Claude Sonnet 4.6 (36%), Gemini 3 Flash, and OpenAI GPT 5.2 performing worst (14%). Models performed better with shorter target codelists (e.g., Eosinophilic esophagitis with four codes, and Hidradenitis suppurativa with 14 codes). For example, all models consistently failed to produce a complete Wrist fracture codelist (with 214 required codes). We further present evaluation summaries, and failure mode evaluations produced by parsing LLM chat logs. Discussion: Besides demonstrating that a single-shot RAG approach is currently not suitable for codelist generation, we demonstrate failure modes including hallucinations, retrieval failures and generation failures where retrieved codes are not used. Conclusions: Our findings suggest that while RAG systems using current frontier LLMs may create correct clinical codelists in some cases, they still struggle with large and complex terminologies and codelists with a large number of codes. The failure mode we highlight can inform the creation of future workflows to avoid failures.

13

Digital Health and Data Utilisation for Improved Primary Health Services Delivery: Multi-Site Perspectives from Quality Improvement Teams in Council Hospitals in Tanzania

Matimo, C. R.; Kacholi, G.; Mollel, H. A.

2026-04-17 health systems and quality improvement 10.64898/2026.04.10.26350674 medRxiv

Top 0.3%

7.0%

Show abstract

BackgroundDigital health plays an indispensable role in facilitating data analysis and use for enhancing healthcare delivery across health settings. However, there is scant information on the extent to which digital health influences the improvement of primary health services delivery through data use. This study examined the determinants that influence the use of digital health to improve health service delivery in council hospitals in Tanzania. MethodsA cross-sectional design was employed in six regions, involving 12 council hospitals. We used a self-administered questionnaire to collect data from 203 members of hospital quality improvement teams. Descriptive analysis was used to determine the frequency, proportion, and mean of responses, while bootstrapping analysis was conducted to test the statistically significant influence of digital health factors on data use for improving health service delivery. ResultsResults show moderate agreement on data compatibility for planning and decision-making, with 40.4% of respondents agreeing it supports ordering commodities, 43.8% for staff allocation, and 38.4% for planning. However, dissatisfaction was higher for user-friendliness (47.8%), reliability (up to 65.5%), and usefulness (up to 63.5%). Overall, 50.2% (M=2.74{+/-}0.87) disagreed that digital systems effectively support data use. Structural model analysis confirmed significant positive influence of usefulness ({beta}=0.199, p<0.001) and access to quality data ({beta}=0.729, p<0.001) on data use, which strongly impacted service delivery ({beta}=0.593, p<0.001), despite some factors showing no direct influence. ConclusionThe study finds that current digital health initiatives only modestly improve the user-friendliness, reliability, and usefulness of data systems, partly due to fragmented, non-interoperable platforms that burden data management. However, compatibility, usability, reliability, and usefulness of digital tools significantly enhance access to quality data and data-driven decisions. The study recommends strengthening and integrating existing systems and providing continuous digital health training to institutionalize data-informed decision-making.

14

PRAM: Post-hoc Retrieval Augmentation for Parameter-Free Domain Adaptation of ICU Clinical Prediction Models

Jeong, I.; Lee, T.; Kim, B.; Park, J.-H.; Kim, Y.; Lee, H.

2026-04-05 health systems and quality improvement 10.64898/2026.04.03.26350132 medRxiv

Top 0.4%

6.9%

Show abstract

Background Clinical prediction models degrade when deployed across hospitals, yet retraining requires technical expertise, labeled data, and regulatory re-approval. We investigated whether post-hoc retrieval augmentation of a frozen model's output, analogous to retrieval-augmented methods in natural language processing, can mitigate this degradation without any parameter modification. Methods We developed the Post-hoc Retrieval Augmentation Module (PRAM), which combines predictions from a frozen base model with outcome information retrieved from similar patients in a local patient bank. Five base models (logistic regression through CatBoost) and three retrieval strategies were evaluated on 116,010 ICU patients across three databases (MIMIC-IV, MIMIC-III, eICU-CRD) for acute kidney injury (AKI) and mortality prediction. A bank size deployment simulation modeled performance from zero to full local data accumulation, complemented by source bank cold start, stress tests, and calibration experiments. Model performance was evaluated using the area under the receiver operating characteristic curve (AUROC). Results Retrieval benefit was inversely associated with base model complexity ({rho} = -0.90 for AKI, -1.00 for mortality): simpler models benefited more, consistent with retrieval capturing residual signal unexploited by the base model. PRAM showed a statistically significant monotone dose-response between bank size and prediction performance across all six outcome-target combinations (Kendall {tau} trend test, q = 0.031 for all). At the pre-specified primary comparison (bank = 5,000), the improvement was confirmed for the two largest-shift settings (eICU-CRD AKI: {Delta}AUROC = +0.012, q < 0.001; eICU-CRD mortality: {Delta}AUROC = +0.026, q < 0.001). Pre-loading a source bank bridged the cold-start gap, providing an immediate performance gain equivalent to approximately 2,000 to 5,000 local patients. Conclusions PRAM provides a parameter-free adaptation mechanism that requires no model retraining, gradient computation, or regulatory re-evaluation at the deployment site. Effect sizes were modest and did not reach cross-model superiority, but the consistent dose-response pattern and the absence of retraining requirements establish retrieval-based adaptation as a viable approach for clinical model transportability. The retrieval mechanism additionally opens a pathway toward case-based interpretability, where predictions are accompanied by identifiable similar patients from the deploying institution.

15

DR. INFO at the Point of Care: A Prospective Pilot Study of an Agentic AI Clinical Assistant

Corga Da Silva, R.; Romano, M.; Mendes, T.; Isidoro, M.; Ravichandran, S.; Kumar, S.; van der Heijden, M.; Fail, O.; Gnanapragasam, V. E.

2026-04-01 health informatics 10.64898/2026.03.31.26349817 medRxiv

Top 0.4%

6.9%

Show abstract

Background: Clinical documentation and information retrieval consume over half of physicians working hours, contributing to cognitive overload and burnout. While artificial intelligence offers a potential solution, concerns over hallucinations and source reliability have limited adoption at the point of care. Objective: To evaluate clinician-reported time savings, decision-making support, and satisfaction with DR. INFO, an agentic AI clinical assistant, in routine clinical practice. Methods: In this prospective, single-arm pilot study, 29 clinicians across multiple specialties in Portuguese healthcare institutions used DR. INFO v1.0 over five working days within a two-week period. Outcomes were assessed via daily Likert-scale evaluations and a final Net Promoter Score. Non-parametric methods were used throughout. Results: Clinicians reported high perceived time saving (mean 4.27/5; 95% CI: 3.97-4.57) and decision support (4.16/5; 95% CI: 3.86-4.45), with ratings stable across all study days and no evidence of attrition bias. The NPS was 81.2, with no detractors. Conclusions: Clinicians across specialties and career stages reported sustained satisfaction with DR. INFO for both time efficiency and clinical decision support. Validation in larger, controlled studies with objective outcome measures is warranted. Keywords: Medical AI assistant, LLMs in healthcare, Agentic AI, Clinical decision support, Point of care AI

16

Evaluating a Multitask AI Model versus Humans for Portion Size Estimation

Nurmanova, B.; Omarova, Z.; Sanatbyek, A.; Varol, H. A.; Chan, M.-Y.

2026-04-18 nutrition 10.64898/2026.04.16.26351036 medRxiv

Top 0.4%

6.9%

Show abstract

Background: Accurate dietary assessment is essential for precision nutrition and effective nutrition surveillance. However, portion size estimation remains a persistent challenge, particularly in culturally diverse regions such as Central Asia. Traditional self-reporting tools often yield inconsistent results due to communal eating practices and unfamiliarity with standard measures. Objective: To address these limitations, this study aimed to compare three methods: unassisted human judgment, visual food atlas assistance, and an artificial intelligence (AI) model, using Central Asian food items. Methods: In this cross-sectional study, 128 participants from Astana, Kazakhstan, visually estimated portion sizes of 51 foods and 8 beverages from standardized photographs. Participants were randomized into two groups: one using unassisted visual estimation and the other aided by a regionally tailored digital food atlas. Additionally, an AI model trained on Central Asian food images was evaluated. Actual food weights served as the reference standard. Accuracy was assessed using Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) across food types and portion sizes. Results: The atlas-assisted group demonstrated the highest accuracy, with the lowest MAE (80.81g) and MAPE (44.76%) across all portions. The AI model showed promising results for average portions (MAE: 79.07g, MAPE: 67.91%) but underperformed on small portions, particularly for meat-based items. Unassisted estimates were the least accurate (MAE: 133.86g, MAPE: 79.40%). Across food categories, visual aids consistently improved accuracy, while AI demonstrated variability by texture and portion size. Conclusions: Culturally adapted visual atlases significantly enhance portion size estimation accuracy in non-Western, communal-eating contexts. While AI models hold promise for dietary assessments, particularly with standard portions and beverages, further refinement is needed for complex food items and small portion types. These findings support the integration of visual and AI-based tools into region-specific dietary monitoring strategies.

17

BSO-AD: An Ontology for Representing and Harmonizing Behavioral Social Knowledge in ADRD

Li, H.; Yu, Y.; Bhandarkar, A.; Kumar, R.; Clark, I. H.; Hu, Y.; Cao, W.; Zhao, N.; LI, F.; Tao, C.

2026-03-31 health informatics 10.64898/2026.03.30.26349756 medRxiv

Top 0.4%

6.7%

Show abstract

Objective: Behavioral and social factors (BSFs) substantially influence the risk, onset, and progression of Alzheimer disease and related dementias (ADRD). A systematic representation of their interplay is essential for advancing prevention and targeted interventions. However, BSF-related knowledge is scattered across heterogeneous sources, limiting scalable evidence synthesis and computational analysis. To address this, we created a Behavioral Social Data and Knowledge Ontology for ADRD (BSOAD) to represent and integrate BSFs with respect to ADRD. Material and Methods: BSOAD was developed following established ontology design principles, prioritizing reuse of existing ontology elements to ensure semantic interoperability. It was built upon the Social Determinants of Health Ontology (SDoHO) and the Drug-Repurposing Oriented Alzheimer Disease Ontology (DROADO). BSF-related classes were enriched with ICD 10 CM Z55 Z65 codes and ADRD related classes with AD Onto. Relationships between BSFs and ADRD were derived through literature mining. Ontology quality was evaluated through Hootation based expert review and an LLM assisted framework assessing structural coverage and semantic coherence. Results: BSO AD contains 2275 classes, 153 object properties, and 49 data properties. Expert review demonstrated strong rational agreement (0.95), with disagreements resolved through discussion. LLM-based evaluation showed high category coverage rates ([≥] 0.97) and robust semantic alignment with the relevant literature (average completeness = 0.79; conciseness = 0.94). Discussion and Conclusion: BSOAD is, to our knowledge, the first ontology to systematically represent BSFs and hierarchically model their interrelationships in ADRD. It establishes a semantic backbone for computational analysis and knowledge integration. The LLM assisted evaluation framework demonstrates the feasibility of scalable, automated ontology assessment.

18

Comparing prognostic performance and reasoning between large language models and physicians

Gjertsen, M.; Yoon, W.; Afshar, M.; Temte, B.; Leding, B.; Halliday, S.; Bradley, K.; Kim, J.; Mitchell, J.; Sanders, A. K.; Croxford, E. L.; Caskey, J.; Churpek, M. M.; Mayampurath, A.; Gao, Y.; Miller, T.; Kruser, J. M.

2026-04-25 intensive care and critical care medicine 10.64898/2026.04.17.26350898 medRxiv

Top 0.4%

6.7%

Show abstract

Importance: Physicians routinely prognosticate to guide care delivery and shared decision making, particularly when caring for patients with critical illnesses. Yet, these physician estimates are prone to inaccuracy and uncertainty. Artificial intelligence, including large language models (LLMs), show promise in supporting or improving this prognostication. However, the performance of contemporary LLMs in prognosticating for the heterogeneous population of critically ill patients remains poorly understood. Objective: To characterize and compare the performance of LLMs and physicians when predicting 6-month mortality for hospitalized adults who survived critical illness. Design: Embedded mixed methods study with elicitation and comparison of prognostic estimates and reasoning from LLMs and practicing physicians. Setting: The publicly available, deidentified Medical Information Mart for Intensive Care (MIMIC)-IV v2.2 dataset. Participants: We randomly selected 100 hospitalizations of adult survivors of critical illness. Four contemporary LLMs (Open AI GPT-4o, o3- and o4-mini, and DeepSeek-R1) and 7 physicians provided independent prognostic estimates for each case (1,100 total estimates; 400 LLM and 700 physician). Main outcomes and measures: For each case, LLMs and physicians used the hospital discharge summary and demographics to predict 6-month mortality (yes/no) and provide their reasoning (free text). We assessed prognostic performance using accuracy, sensitivity, and specificity, and used inductive, qualitative content analysis to characterize reasonings. Results: Mean physician accuracy for predicting mortality was 70.1% (95% CI 63.7-76.4%), with sensitivity of 59.7% (95% CI 50.6-68.8%) and specificity of 80.6% (95% CI 71.7-88.2%). The top-performing LLM (OpenAI o4-mini) accuracy was 78.0% (95% CI 70.0-86.0%), with sensitivity of 80.0% (95% CI 67.4-90.2%) and specificity of 76.0% (95% CI 63.3-88.0%). The difference between mean physician and top-performing LLM accuracy was not statistically significant (p = 0.5). Qualitative analysis revealed similar patterns in LLM and physician expressed reasoning, except that physicians regularly and explicitly reported uncertainty while LLMs did not. Conclusion and Relevance: In this study, LLMs and physicians achieved comparable, moderate performance in predicting 6-month mortality after critical illness, with similar patterns in expressed reasoning. Our findings suggest LLMs could be used to support prognostication in clinical practice but also raise safety concerns due to the lack of LLM uncertainty expression.

19

Development and Pilot Validation of ABHA-O-SHINE: An AI-Ready Oral Health Risk and Insurance Prediction Framework within the Ayushman Bharat Digital Ecosystem

Saxena, Y.; SHRIVASTAVA, L.

2026-04-01 public and global health 10.64898/2026.03.31.26349846 medRxiv

Top 0.4%

6.5%

Show abstract

Background: Oral health remains inadequately integrated within the Ayushman Bharat Digital Mission (ABDM), particularly in terms of structured risk assessment and its linkage to insurance-based decision-making. There is a growing need for scalable models that can connect clinical oral health data with digital health systems and support future artificial intelligence (AI)-driven applications. Aim: To develop and pilot test the ABHA-O-SHINE framework for oral health risk prediction and insurance prioritization, with a future scope for AI integration within the Ayushman Bharat Health Account (ABHA) ecosystem. Materials and Methods: A cross-sectional pilot study was conducted among 126 participants attending the outpatient department of Swargiya Dadasaheb Kalmegh Smruti Dental College and Hospital, Nagpur. Participants were selected based on predefined inclusion and exclusion criteria. Data collection included a structured questionnaire and clinical examination using the WHO Oral Health Assessment Form (2013). A composite risk score (0 to 14) was developed incorporating behavioral and clinical parameters. Participants were categorized into low, moderate, and high-risk groups, and corresponding insurance priority levels were assigned. Statistical analysis included descriptive statistics, Chi-square test, Spearman correlation, and binary logistic regression. Results: The majority of participants were categorized under moderate to high-risk groups. Tobacco use showed a statistically significant association with higher risk levels (p less than 0.05). Positive correlations were observed between total risk score and clinical indicators such as DMFT and CPI. Logistic regression analysis identified tobacco use and clinical scores as significant predictors of high-risk categorization. Conclusion: The ABHA-O-SHINE framework demonstrates feasibility in integrating oral health risk assessment with an insurance prioritization model. The framework is designed to be AI-compatible, enabling future automation through machine learning and image-based analysis within the ABDM ecosystem. Keywords: ABHA, ABDM, Oral Health, Risk Assessment, Insurance, Artificial Intelligence.

20

Vision Language Model for Coronary Angiogram Analysis and Report Generation: Development and Evaluation Study

Jiang, Q.; Ke, Y.; Sinisterra, L. G.; Elangovan, K.; Li, Z.; Yeo, K. K.; Jonathan, Y.; Ting, D. S. W.

2026-04-21 cardiovascular medicine 10.64898/2026.04.19.26351241 medRxiv

Top 0.4%

6.4%

Show abstract

Coronary artery disease is a leading cause of morbidity and mortality. Invasive coronary angiography is currently the gold standard in disease diagnosis. Several studies have attempted to use artificial intelligence (AI) to automate their interpretations with varying levels of success. However, most existing studies cannot generate detailed angiographic reports beyond simple classification or segmentation. This study aims to fine-tune and evaluate the performance of a Vision-Language Model (VLM) in coronary angiogram interpretation and report generation. Using twenty-thousand angiogram keyframes of 1987 patients collated across four unique datasets, we finetuned InternVL2-4B model with Low-Rank Adaptor weights that can perform stenosis detection, anatomy labelling, and report generation. The fine-tuned VLM achieved a precision of 0.56, recall of 0.64, and F1-score of 0.60 for stenosis detection. In anatomy segmentation, it attained a weighted precision of 0.50, recall of 0.43, and F1-score of 0.46, with higher scores in major vessel segments. Report generation integrating multiple angiographic projection views yielded an accuracy of 0.42, negative predictive value of 0.58 and specificity of 0.52. This study demonstrates the potential of using VLM to streamline angiogram interpretation to rapidly provide actionable information to guide management, support care in resource-limited settings, and audit the appropriateness of coronary interventions. AUTHOR SUMMARYCoronary artery disease has heavy disease burden worldwide and coronary angiogram is the gold standard imaging for its diagnosis. Interpreting these complex images and producing clinical reports require significant expertise and time. In this study, we fine-tuned and investigated an open-source VLM, InternVL2-4B, to interpret and report coronary angiogram images in key tasks including stenosis detection, anatomy identification, as well as full report generation. We also referenced the fine-tuned InternVL2-4B against state-of-the-art segmentation model, YOLOv8x, which was evaluated on the same test sets. We examined how machine learning metrics like the intersection over union score may not fully capture the clinical accuracy of model predictions and discussed the limitations of relying solely on these metrics for evaluating clinical AI systems. Although the model has not yet achieved expert-level interpretation, our results demonstrate the potential and feasibility of automating the reporting of coronary angiograms. Such systems could potentially assist cardiologists by improving reporting efficiency, highlightning lesions that may require review, and enabling automated calculations of clinical scores such as the SYNTAX score.